Table of Contents
library(mosaic)
library(tidyverse)
library(pander)
library(DT)
library(ggrepel)
library(plotly)
library(dplyr)
library(ggplot2)
library(maps)
library(tmap)
library(leaflet)
library(htmltools)
library(car)
library(mosaicData)
library(ResourceSelection)
library(reshape2)
library(RColorBrewer)
library(scatterplot3d)
library(readr)
library(prettydoc)
library(knitr)
library(kableExtra)
library(formattable)
library(haven)
library(rmarkdown)
Click between each section to look at our Concepts or the Application in our Prediction Weather Analysis
Residual Concepts
What is a Residual?A residual is just the difference between:
\[r_i = Y_i - \hat{Y_i}\]
Think of it as “how far off was my prediction of that jar of jelly beans?”
Click between tabs for further explanations
Hide Residual Explainations
Why do residuals matter?
When looking at residual plots, you want to see points scattered randomly - like if someone threw a bunch of marbles on the floor (accidentally of course). If you see clear patterns, something might be wrong with your model.
Real Life Comparison for Residuals
Imagine you’re baking cookies:
The residual is how far off your actual baking time was from the predicted 12 minutes. Sometimes it’s over, sometimes under, and sometimes exactly right (just depends on how burnt you like your cookies, JK).
This helps you see how accurate your recipe’s timing prediction is for each batch of cookies.
data <- data.frame(
BatchNumber = 1:10,
ActualTime = c(13, 11, 12, 14, 10, 12, 15, 9, 13, 11),
PredictedTime = rep(12, 10)
)
# Fit the linear model
model <- lm(ActualTime ~ BatchNumber, data = data)
data$PredictedValue <- predict(model)
# Create a plot with residuals visually connected
ggplot(data, aes(x = BatchNumber, y = ActualTime)) +
geom_point(color = "pink", size = 3) + # Data points
geom_smooth(method = "lm", se = FALSE, color = "gray") + # Regression line
geom_segment(aes(x = BatchNumber, y = ActualTime, xend = BatchNumber, yend = PredictedValue),
color = "pink", linetype = "solid", size = 0.8) + # Residuals lines
labs(title = "Linear Regression: Actual Baking Time vs Batch Number",
x = "Batch Number", y = "Actual Baking Time (minutes)") +
theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using formula = 'y ~ x'
What is a Sum of Squares Error (SSE)?
A SSE is the measurement of how much the residuals(the observed value - the predicted value) deviate from the line(the law). This can also be explained as the amount of variability that is NOT explained by the model.
This is calculated by the following model:
\[SSE = \underbrace{\sum_{i=1}^n}_\text{The sum of} (\underbrace{Y_i}_\text{Observed Value(The Dots)} - \underbrace{\hat{Y_i}}_\text{Predicted Value (The Line)})^2 \]
Click between tabs for further explanations
Hide SSE Explainations
Why does the SSE matter?
We want these differences to be small compared to how much your drive times vary overall (SSTO).
Real Life Comparison of SSE
Think of predicting how long it takes to drive to work:
Your actual drive times vary (maybe 20, 25, or 30 minutes depending on things traffic, how fast you drive, who knows?), but your prediction model says it always takes 23 minutes
What is a Sum of Squares Regression (SSR)?
A SSR is the measurement of how much the regression line (the law) deviates from the average y-value (overall mean). This can also be explained as the amount of variability EXPLAINED by the model by showing how far our predicted y values deviate from the overall mean.
This can be calculated by the following model:
\[SSR = \underbrace{\sum_{i = 1}^n}_\text{The sum of} (\underbrace{\hat{Y_i}}_\text{Predicted Y (The Line)} - \underbrace{\bar{Y}}_\text{Average Y (Overall Mean)})^2\]
Click between tabs for further explanations
Hide SSR Explainations
Why does SSR matter?
SSR matters because it tells us how good our predictions are.
It shows how much of what we’re trying to predict can actually be explained by our model - A larger SSR means our predictions are more reliable and useful - It helps us decide if our prediction method is worth using
Real life Comparison of SSR
Imagine predicting pizza delivery times:
The delivery app says:
SSR measures how much these categories actually help EXPLAIN delivery times. For example:
Just like you can’t have “negative accuracy” in predictions, SSR can’t be negative. The bigger the SSR compared to total variation (SSTO), the better your prediction model is working.
What is a Sum of Squares Total (SSTO)?
A SSTO is the measurement of how much the y-values deviate from the average y- value. This can also be explained as the total variability of our model.
Key Relationship: SSTO = SSR + SSE -> Total Variation = Explained Variation + Unexplained Variation
This is calculated by the following:
\[SSR + SSE = SSTO = \underbrace{\sum_{i=1}^n}_\text{The sum of} (\underbrace{Y_i}_\text{Observed Y Values (The Dots)} - \underbrace{\bar{Y}}_\text{Average Y (Overall Mean)})^2\]
Click between tabs for further explanations
Hide SSTO Explainations
Why does SSTO matter?
The total variation (SSTO) helps us to know if our predictions are actually useful or just lucky guesses!
Real Life Comparison of SSTO
Imagine you own a coffee shop and want to understand your daily sales patterns:
Total Variation (SSTO):
This total variation can be broken into two parts:
The better your prediction model, the more of your total variation (SSTO) is explained by your model (SSR), and the less remains unexplained (SSE).
What is R-squared?
The R-squared is the proportion of variability in Y that can be explained by the regression.
\[R^2 = \frac{SSR}{SSTO} = 1 - \frac{SSE}{SSTO} \]
Correlation:
Direction of relationship:
Click between tabs for further explanations
Hide R Squared Explainations
Why does R Squared matter?
It tells us how reliable our predictions are! - Additionally, it shows us how confident we can be in our predictions (using correlation and strength)
R-squared VS P-value
We can further understand R-squared by how it differs from the p-value for slope:
Real Life Comparison for R Squared
Imagine trying to predict ice cream sales based on temperature:
R-squared tells you how much temperature actually explains ice cream sales:
If R-squared = 0.80 (or 80%): - Temperature explains 80% of why ice cream sales go up or down - The other 20% might be due to other factors like holidays or promotions
Correlation in this scenario works like this:
Perfect Positive Correlation (+1.00):
Strong Positive Correlation (around +0.8):
No Correlation (0):
Negative Correlation (towards -1.00):
Think of it as how confident you can be in your predictions based on the relationship between two things.
What is the Mean Squared Error (MSE) & the “Residual Standard Error”(RSE)?
The MSE is the measurement of the average squared difference between predicted and actual values - Can be any non-negative number (0 to infinity) - Units are squared units of the original data (e.g., degrees Fahrenheit²)
\[MSE = \frac{SSE}{n-p}\]
Relationship to R-squared
| MSE | R-Squared |
|---|---|
| measures prediction error | measures variance explained |
| between 0 and infinity | between 0 and 1 (0% - 100%) |
| units are squared units of the original data | unitless |
The Residual Standard Error (RSE) is the square root of MSE. - Found in R regression summary output - Uses same units as original data (e.g., degrees Fahrenheit)
\[RSE = \sqrt{MSE} = \sqrt{\frac{SSE}{n-p}}\]
Click between tabs for further explanations
Hide MSE and RSE Explainations
Why do the MSE and the RSE matter?
Together, they indicate the fit of our model:
Real Life Comparison of MSE and Residual Error
Think of predicting daily temperatures:
The MSE would be like measuring how far off your temperature predictions are on average, but the errors are squared - If you predict 75°F and it’s actually 73°F - that’s a difference of 2°F, which gets squared to 4°F² - The MSE would be the average of all these squared differences
The Residual Standard Error (RSE) would convert this back to the original temperature units by taking the square root - So instead of 4°F², you’d get back to a value in °F - making it more intuitive to understand how far off your predictions typically are
Lower values in both cases would mean your temperature predictions are more accurate!! - aka. you’re better at forecasting the actual temperatures that occur! (like a psychic)
Application on Weather Prediction Analysis
For this study, we were tasked with predicting the “Actual Maximum Air Temperature” for this coming Monday, January 13th at BYU-Idaho. BYU-Idaho is located in the city of Rexburg, Idaho, and thus we will use this city’s weather recordings from timeanddate.com to make our predictions.
janweather <- read.csv("C:/Users/paige/OneDrive/Documents/Fall Semester 2024/MATH 325/Statistics-Notebook-master/Data/JanWeather.csv")
prediction <- data.frame(
STARTMAXTEMP=16,
MAXTEMP= 26,
label = "Prediction Point : 26°F"
)
janweathery_plot <- ggplot(janweather, aes(x = STARTMAXTEMP, y = MAXTEMP)) +
geom_point(
aes(
text = paste(
"Date:", DATE, "<br>",
"Start Max Temp. of the Day:", STARTMAXTEMP, "\u00b0F<br>",
"Max Temp. of the Day:", MAXTEMP, "\u00b0F"
)
),
size = 2,
color = "darkblue"
) +
geom_smooth(method = "lm", formula= y~x, se = FALSE, color = "dodgerblue") +
labs(
title = "Weather Patterns from January 13th's of the Past",
x = "Max Start Temperature of the Day (\u00b0F)",
y = "Max Temperature of the Day (\u00b0F)"
) +
geom_point(data=prediction,
aes(x=STARTMAXTEMP, y=MAXTEMP),
size = 3,
color= "red") +
geom_text(
data = prediction,
aes(x = STARTMAXTEMP, y=MAXTEMP, label = label),
nudge_x = -7,
nudge_y = 3.6,
color= "red",
size = 3
) +
theme_minimal()
ggplotly(janweathery_plot, tooltip = "text")
This is our mathematical model: \[\underbrace{Y_i}_\text{MAXTEMP} = \overbrace{\beta_0}^\text{Intercept} + \overbrace{\beta_1}^\text{Slope} \underbrace{X_i}_\text{STARTMAXTEMP}+ \epsilon_i \text{ where} \sim N(0,\sigma^2)\]
This is our Simple Linear Regression test:
janlm <- lm(MAXTEMP ~ STARTMAXTEMP, data=janweather)
summary(janlm)%>%
pander()
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 13.68 | 2.583 | 5.297 | 0.001835 |
| STARTMAXTEMP | 0.743 | 0.1214 | 6.119 | 0.0008698 |
| Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
|---|---|---|---|
| 8 | 4.275 | 0.8619 | 0.8389 |
Using this study, we will go further in depth with applying how residuals work in this study.
What does the residual tell us about our predicted temperature for Monday January 13th?
As a reminder residuals are the difference between the observed value (\(Y_i\)) and the predicted value (\(\hat{Y_i}\)).
In context of this study, the residual of a given point would be the
difference between the observed MAXTEMP and the predicted
MAXTEMP. This can be depicted as the following:
\[Residual = \text{Observed MAXTEMP - Predicted MAXTEMP}\] Below is the table of residuals for all 8 of the points used in this data set.
pander(janlm$residuals)
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
|---|---|---|---|---|---|---|---|
| -3.683 | -2.259 | 4.943 | 3.026 | -6.745 | 3.088 | 1.54 | 0.08834 |
| Residual Value | Meaning |
|---|---|
| Positive Residual(+) | the prediction MAXTEMP is lower than
the observed MAXTEMP (aka. an under prediction) |
| Negative Residual(-) | the prediction MAXTEMP is higher than
the observed MAXTEMP (aka. an over prediction) |
| Close to 0 | the prediction MAXTEMP is very close
to the observed MAXTEMP (aka. a good fit prediction) |
janweather$predicted_MAXTEMP <- predict(janlm)
# Calculate residuals (difference between actual and predicted values)
janweather$residuals <- janweather$MAXTEMP - janweather$predicted_MAXTEMP
# Plot with regression line and residuals
ggplot(janweather, aes(x = STARTMAXTEMP, y = MAXTEMP)) +
geom_point(
size = 2,
color = "pink"
) +
geom_smooth(method = "lm", formula = y ~ x, se = FALSE, color = "black") + # Mean line
# Add vertical lines representing residuals
geom_segment(aes(x = STARTMAXTEMP, xend = STARTMAXTEMP, y = predicted_MAXTEMP, yend = MAXTEMP),
color = "pink", linetype = "solid", size = 0.8) + # Residuals (error lines)
labs(title = "Residuals of Weather Prediction Analysis") +
theme_minimal()
How do the SSE, SSR, and SSTO apply to this study?
These values are depicted below:
janweather$predicted_MAXTEMP <- predict(janlm)
janweather$residuals <- janweather$MAXTEMP - janweather$predicted_MAXTEMP
SSTO <- sum((janweather$MAXTEMP - mean(janweather$MAXTEMP))^2)
SSR <- sum((janweather$predicted_MAXTEMP - mean(janweather$MAXTEMP))^2)
SSE <- sum(janweather$residuals^2)
pander(cat("SSE:", round(SSE,2), "\n"))
SSE: 109.66
pander(cat("SSR:", round(SSR,2), "\n"))
SSR: 684.34
pander(cat("SSTO:", round(SSTO,2), "\n"))
SSTO: 794
Here is how these concepts apply:
| Concept | Meaning | Application |
|---|---|---|
| Sum of Squared Errors (SSE) | measures the unexplainable variation in the data | - how much of the variation in MAXTEMP is not explained
by the relationship with STARTMAXTEMP- We want our SSE to
be smaller than our SSTO as that indicates our model is
a good fit and the amount of unexplained variability we have, and with a
SSE of 109.66, this confirms that our model is a good fit and doesn’t
have a lot of unexplained variability |
| Sum of Squared Regression (SSR) | measures the explainable variation in the data | - how much of the variation in MAXTEMP is explained by
the relationship with STARTMAXTEMP- We want our SSR to be
big as that indicates our model is a good fit, and with
a SSR of 684.34 this confirms that our model does a good job at
explaining the variability of MAXTEMP and a good fit for
our data |
| Sum of Squared Total(SSTO) | measures the total variation in the data, combining the explained and unexplained parts | - total variability in MAXTEMP |
Graphs of the SSR, SSE, and SSTO
SSR Graph
mean_MAXTEMP <- mean(janweather$MAXTEMP)
ggplot(janweather, aes(x = STARTMAXTEMP, y = MAXTEMP)) +
geom_point(
size = 2,
color = "grey"
) +
geom_smooth(method = "lm", formula= y~x, se = FALSE, color = "black") +
geom_segment(aes(x = STARTMAXTEMP, xend = STARTMAXTEMP, y = predicted_MAXTEMP, yend = mean_MAXTEMP),
color = "green", linetype = "dashed", size= .8) +
geom_hline(yintercept = mean_MAXTEMP, color = "grey", linetype = "solid", size = 0.8) +
labs(title = "SSR of Weather Prediction Analysis") +
theme_minimal()
SSE Graph
ggplot(janweather, aes(x = STARTMAXTEMP, y = MAXTEMP)) +
geom_point(
size = 2,
color = "grey"
) +
geom_smooth(method = "lm", formula= y~x, se = FALSE, color = "black") +
geom_segment(aes(x = STARTMAXTEMP, xend = STARTMAXTEMP, y = MAXTEMP, yend = predicted_MAXTEMP),
color = "red", linetype = "dotted", size = 1) +
geom_hline(yintercept = mean_MAXTEMP, color = "grey", linetype = "solid", size = 0.8) +
labs(title = "SSE of weather Prection Analysis") +
theme_minimal()
SSTO Graph
ggplot(janweather, aes(x = STARTMAXTEMP, y = MAXTEMP)) +
geom_point(
size = 2,
color = "grey"
)+
geom_segment(aes(x = STARTMAXTEMP, xend = STARTMAXTEMP, y = MAXTEMP, yend = mean_MAXTEMP),
color = "blue", linetype = "dotted", size = 1) +
geom_hline(yintercept = mean_MAXTEMP, color = "blue", linetype = "dotted", size = 1) +
labs(title = "SSTO of Weather Prediction Analysis") +
theme_minimal()
All Together
ggplot(janweather, aes(x = STARTMAXTEMP, y = MAXTEMP)) +
geom_point(
size = 2,
color = "grey"
) +
geom_smooth(method = "lm", formula= y~x, se = FALSE, color = "lightblue") +
geom_segment(aes(x = STARTMAXTEMP + 0.1, xend = STARTMAXTEMP + 0.1, y = predicted_MAXTEMP, yend = mean_MAXTEMP),
color = "green", linetype = "dashed", size = .8) +
geom_segment(aes(x = STARTMAXTEMP + 0.2, xend = STARTMAXTEMP + 0.2, y = MAXTEMP, yend = predicted_MAXTEMP),
color = "red", linetype = "dotted", size = 1) +
geom_segment(aes(x = STARTMAXTEMP + 0.3, xend = STARTMAXTEMP + 0.3, y = MAXTEMP, yend = mean_MAXTEMP),
color = "blue", linetype = "dotted", size = 1) +
geom_hline(yintercept = mean_MAXTEMP, color = "blue", linetype = "dotted", size = 2) +
geom_hline(yintercept = mean_MAXTEMP, color = "grey", linetype = "solid", size = 0.8) +
labs(title = "SSR, SSE, and SSTO of Weather Prediction Analysis") +
theme_minimal()
What does R Squared offer to this study?
In this study, R Squared explains how well our independent
variable, STARTMAXTEMP, explains/predicts our dependent
variable, MAXTEMP.
You can find our R Squared value by either computing in the equation below or by looking in our Simple Linear Regression Test under \(R^2\).
\[R^2 = \frac{SSR}{SSTO} = \frac{684.34}{794} = 0.8619 \]
janlm <- lm(MAXTEMP ~ STARTMAXTEMP, data=janweather)
summary(janlm)%>%
pander()
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 13.68 | 2.583 | 5.297 | 0.001835 |
| STARTMAXTEMP | 0.743 | 0.1214 | 6.119 | 0.0008698 |
| Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
|---|---|---|---|
| 8 | 4.275 | 0.8619 | 0.8389 |
With this value, we can interpret our 0.8619 \(R^2\) value with the following table:
| \(R^2\) Value | Interpretation |
|---|---|
| close to +/- 1 | Perfect fit, perfectly variablility in MAXTEMP using
STARTMAXTEMP |
| around 0 | Not a good fit, does not explain ANY variablility in
MAXTEMP and there is no relationship between the two
variables |
MAXTEMP that can be
explained with the STARTMAXTEMP
ggplot(janweather, aes(x = STARTMAXTEMP, y = MAXTEMP)) +
geom_point(
size = 2,
color = "purple"
) +
geom_smooth(method = "lm", formula = y ~ x, se = FALSE, color = "black") + # Mean line
# Add vertical lines representing residuals
geom_segment(aes(x = STARTMAXTEMP, xend = STARTMAXTEMP, y = predicted_MAXTEMP, yend = MAXTEMP),
color = "purple", linetype = "solid", size = 0.8) +
geom_rect(aes(xmin=STARTMAXTEMP, xmax=STARTMAXTEMP+janlm$res, ymin=MAXTEMP , ymax=janlm$fit), alpha = 0.3, color="purple") +
labs(title = "Residuals of Weather Prediction Analysis") +
theme_minimal()
How does the MSE and “Residual Standard Error”(RSE) apply to this study?
Both the MSE and the “Residual Standard Error” help in assessing the
accuracy and reliability of our weather prediction model. - MSE giving
us the overall unitless measure of our prediction error - Lower MSE:
model is doing well in predicting the Y(MAXTEMP) from the
X(STARTTEMP) - Higher MSE: model is NOT doing well in
predicting the Y(MAXTEMP) from the
X(STARTTEMP), as the data does not fit well - “Residual
Standard Error” gives us a specific unit measurement of how much error
is present in our model’s predictions
predictions <- predict(janlm)
MSE <- mean((janweather$MAXTEMP - predictions)^2)
rse <- sqrt(MSE)
pander(cat("MSE:", round(MSE,2), "\n"))
MSE: 13.71
pander(cat("RSE:", round(rse,2), "°F"))
RSE: 3.7 °F
With these values we are able to deduce the following: - MSE: The
average of all the squared differences is 13.71 - RSE:
On average, the MAXTEMP from our study is about
3.7°F from the actual values